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Abstract 

We consider high-dimensional inference when the assumed linear model is misspecified. We 
describe some correct interpretations and corresponding sufficient assumptions for valid asymp¬ 
totic inference of the model parameters, which still have a useful meaning when the model is 
misspecified. We largely focus on the de-sparsified Lasso procedure but we also indicate some im¬ 
plications for (multiple) sample splitting techniques. In view of available methods and software, 
our results contribute to robustness considerations with respect to model misspecification. 


1 Introduction 


The construction of confidence intervals and statistical hypothesis tests is a primary goal for as¬ 
sessing uncertainty in high-dimensional inference. Most of the recent contributions for this task 


discuss some methods and approaches for high-dimensional linear models (Biihlmam 

1, 20131 

Zhang 

and Zhang, 2014[ van de Geer et al.l 2014; Javanmard and Montanari, 2014| |Meim 

jhausen 

120151 

Foygel Barber and CandesJ 2014 

, but generalized linear models (Meinshausen et al. 

2009[ |Minnier 

et al. 2011 van de Geer et al. 

2014), undirected graphical models (Ren et al. 

2015; Jankova 

and van de Geer 2014), instrumental variable models (Belloni et al. 2012) or very general models 


(Meinshausen and Buhlmann, 2010) have been considered as well, and all of these latter references 


cover linear models as special case. Another philosophy for inference in the high-dimensional set¬ 


ting is based on selective inference (Benjamini and Yekutieli, 2005 Lockhart et ah, 2014; Taylor 


et ah, 2014), but we do not consider this here. Our goal is to interpret and analyze the meaning 
of inference procedures when the linear model is misspecified. We address this issue in greater 


detail for the de-sparsified (or de-biased) Lasso (Zhang and Zhang, 2014), but we make a few more 
general comments in Section [6.1[ 


More concretely, we describe the correct interpretations and corresponding (sufficient) assump¬ 
tions which guarantee valid asymptotic inference for the parameters in a high-dimensional, mis¬ 
specified linear model. That is, we assume that the data is generated from an underlying true 
nonlinear model Y = f{X) + ^ but we fit the wrong linear model Y = Xj3^ + e to the data; see for 
example Wasserman ( |2014 ) who describes such settings as “weak modeling”. Precise definitions of 
the models are given later. Some arising questions are: first, what is the interpretation of /3^; and 
secondly, is the standard de-sparsified Lasso procedure valid for construction of statistical hypoth¬ 
esis tests and confidence intervals for the components /3° {j = 1,... ,p). Regarding the first issue, 
it is important to distinguish between random and fixed design scenarios. Regarding the second 
point, we do give sufficient conditions for asymptotic correctness of the de-sparsified Lasso proce¬ 
dure, although for the random design case, one has to estimate the asymptotic variance differently 
than for correctly specified models. 
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The novelty of this work is that we explicitly discuss the implications of linear model misspec- 
ification for construction of confidence intervals and hypothesis testing in high dimensions. We 
believe that this is a missing piece which should be addressed and which is informally often treated 
according to the folklore that the procedure leads to inference for the “best projected regression 
parameters”: we make this precise and also show that some modifications are necessary for the 
random design case (see above). The latter are implemented in the statistical R-software package 


( ]Dezeure et al. 2014) 


hdi (Meier et ah, 2014) which includes various methods for frequentist high-dimensional inference 


The de-sparsified Lasso for potentially misspecified linear mod¬ 
els 


We consider n data points (Y^^\ ..., with univariate responses and p- 

dimensional covariables X^^h We denote by T = ..., and Xj = {Xj^\ ..., (j = 

,p) the n X 1 vectors, and by X = (Xi,..., Xp) the n x p design matrix. 
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We fit a potentially misspecified linear model 

Y = X/l° + e, 


( 1 ) 


where the model assumptions are as follows: i.i.d. distributed rows of X (if random), and i.i.d. 
components of s having mean zero, variance and which are uncorrelated from X. In a misspecified 
setting, the meaning of the parameter vector and of the errors e depends on the context, in 
particular whether the design is random or fixed. The different interpretations are presented in 
Sections [3] and |4] below. 

For constructing confidence intervals and hypothesis tests for the individual parame ters (j = 
1,... ,p), we consider the de-sparsified Lasso, originally proposed by Zhang and Zhang (2014). The 


procedure is as follows. First, do a Lasso (Tibshirani, 1996) or square root Lasso (Belloni et al. 


2011) regression fit of Xj versus all other variables from X_j, the n x {p — 1) design matrix whose 
columns correspond to the variables {X^; k ^ j}. That is, for the Lasso, 


|X,--X_y 




i\\l/n + \xM\i) . 

|2/\/n-k AxIItIIi) • 


= argmm.^gKP-1 
or using the square root Lasso, 

= argmin.^gjjp-1 ' 

The residuals of such a regression are denoted by 

We then project the response Y onto this residual vector: if the model ([^ were correct, we have 


( 2 ) 


(3) 


zjx, 


-s?+E 


ZjXk 


Pk + 


zjx- 


This suggests a bias correction as follows. Pursue a Lasso regression of Y versus X: 

/3 = argmin^ (||y - X/3||i/n -h A||/3||i) , 
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plug it into the bias term and subtract the estimated bias. This leads to the de-sparsified Lasso 
estimator: 


ZjY 
hi = ^ 


ZJX, 


^ ZjXk 


k^j 


(4) 


From the construction and assuming that model 0 is correct, we heuristically obtain: 

:AA(0,1), 


zjx, 


ih - l^j) = J2 r 


zTx 


yT^ 

‘ (A - A") + ' 


Zje 


no;. 


PUJ 


nw. 


V\30 


where we assume for the first approximation that the error in estimating the bias is negligible, 
and where ojp.jj is the asymptotic variance of Zjejy/n. This reasoning has been made rigorous in 


earlier work, assuming some conditions (Zhang and Zhang, 2014 van de Geer et al., 2014). When 
the model 0 is wrong, however, the heuristics above needs to be justified anew. Also from a 
practical point of view, we need to characterize the meaning for (3^ and we need to determine the 
correct specification of in order to construct asymptotically correct confidence intervals and 
tests. The details are described in the following Sections and 


The procedure for the de-sparsified Lasso bj in (|4j) remains (essentially) the same regardless 
whether the linear model is correct or not. Referring to the parenthesis in the previous sentence, 
what potentially changes relative to a correctly specified model is the proper asymptotic variance 
see Section 3.ll and this new feature is now also implemented in the R-software package hdi 


w. 




(Meier et al., 2014). 


Throughout the paper, the asymptotic statements are for the setting where the dimension 
p = Pn is allowed to depend on n (and hence also the random variables in the model), and we 
consider the behavior as n —)• oo, typically with p = —)• oo at a much faster rate than n. We 
often suppress the index n in the notation. 


3 Random design model 

Consider the true model 


y(0) ^ j0(^(0)) ^^(0), (5) 

where is independent of with = 0. For simplicity, we assume that E[/'^(A(*’))] = 0 

as well as E[A(*^)] = 0, and that furthermore the second moments of X^^i and exist. We 
assume that the data are realizations of (T^^^ ..., (T^"'), of i.i.d. copies of (T^^^ A^*^)) 

from model ([^. 

Consider the linear projection 

y(o) ^ (x(°))^/ 3 ° + e(°\ 

/3° = argmin^E|/°(A(°)) - ( 6 ) 

where, due to the projection property, E[e^'^^A(‘^^] = Cov(e(°), A*^*^^) = 0. We denote the support 
of /3o by So = {j; /3° / 0}. While E[e(°)] = 0 we typically have that E[e(‘^)|A(°)] 7 ^ 0, because 
E[e^'^^|A(°)] = /^(A^^)) — X^^^/3^. Thus, when conditioning on A^*^^ the assumption of zero mean 
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for the error is not valid. However, when the inference for (5^ is unconditional (not conditioning on 
then we have zero mean for the error: therefore, due to model misspecification, the inference 
with random design should always be unconditional on 

We note that still has interesting model-free (and well known) interpretations such as: the 
jth component = Lj • Parcorr(y^^\X® |{X®; k ^ j}) equals the partial correlation between 
and given all other variables, up to a constant Lj = ^Kjj/KyY-, where K ^ is the 
(p -|- 1) X (p -|- 1) covariance matrix of {Y-, X); thus, (3^ measures the linear effect of Xj on Y after 
adjusting for the linear effects of all other variables X^ {k ^ j) on Y. In addition, for Gaussian 
design, we have the following important interpretation: if /3j ^ 0 , then the variable xj^^ is in the 
active set (i.e., relevant) of the nonlinear true function /^, see Proposition]^ 

We consider here a concrete set of assumptions for Theorem]^ below. Denote by 


7° = argmin..yE|X® - 


^( 0 ) _ 




E7m4“>i 

( 0 ) 

k 


the population regression vector and residual variables when regressing the random variable X® 
on all other variables k ^ j}. It is well known that 7 ? = —(S“^),j/(S~^)jj, where (S“^),j 

denotes the jth column vector of (assuming it exists, see (Al)). 

Assumptions. 

The covariables are such that: 

(Al) Cov(X(°^) = S has smallest eigenvalue A^;j^(S) > Ci > 0; 

(A2) maxj ||X®||oo < C 2 < 00 ; 

(A3) ||yf ||oo<C3<oo; 

(A4) We have either: 

(a) liTjlli = o{'s/nj log(p)), || 7 °||r = o ((n/log(p))~ log(p)“^/^) for 0 < r < 1 , and the 

maximal eigenvalue of X^^Xg^./n satisfies = Op(l), where X^^. denotes the 

submatrix of the design with columns corresponding to Sj = {k‘, 7 j ^ 7 ^ 0}; 

or 

(b) Sj = ISjl = || 7 °||[j = T,k^j H{^~^)jk / 0 ) = o(Vra/log(p)). 


Regarding the structure of the regression: 

(A5) The sparsity satisfies either: 

(a) ||/3°||i = o{^/n/log{p)), ||/3°||;:A(;^,^,^(S’o) = op ((n/ log (p))V log(p)- 1 / 2 ) for 0 < r < 1, 
and the maximal eigenvalue of X^^X 5 Q/n satisfies A ^a, ^(5o) = Op(l), where denotes 
the submatrix of the design with columns corresponding to Sq = {j; /3? 7 ^ 0}; 
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or 


(b) = |S„| = ||/S»||8 = '(/S? 0) = o(v/S/log(p)). 

(A6) For the second moment (^p-jj '■= uip.jj > C 4 for some constant C 4 > 0. (The 

existence of ujpjj < 00 is implied by (A3) and (A7)). 

(A7) The error satisfies one of the following conditions: 

(a) < V, where F is a fixed random variable (not depending on p) with E|Fp < 00 ; 


or 


(b) < Cs < 00 for some <5 > 0. 

Either of the conditions implies that for some constant Cg < 00 , < Ce < 00 . 


The assumptions (A2) and (A3) are somewhat restrictive (see also (Bl) in van de Geer et al. (2014)). 

''*^'11 is bounded. Examples where (A7) 


Assumption (A3) is implied by (A2) and assuming that , 1,^1 
holds are discussed in Section [3.2[ Regarding the assumptions (A4) and (A5) we first note that: 


(A4) can be replaced by (D2) in Section 7.1.1 


(A5) can be replaced by (D3) in Section 7.1.1 


see Section 7.1 and Lemma Eurthermore, for sparsity in (A4,a) and (A5,a), the condition on 
the maximal eigenvalue can be relaxed by requiring for e.g. (A5,a) that 

= {j; |/3°| > C'Vlog(p)/n/A^ax(5o)}, 

for some 0 < C < 00 , has cardinality S* = o(n/log(p)); and analogously for condition (A4,a). 
Requiring some sparsity for the design as in (A4) is due to our proof of Proposition this is 
in contrast for fixed design, where no sparsity condition on the design is needed when using the 
nodewise square root Lasso in ([^ (see Theorem [2]) . Finally, a sparsity assumption as in (A5) is 


typical for the de-sparsified Lasso (Zhang and Zhang, 2014 van de Geer et ah, 2014, van de Geer| 


2014). 


Theorem 1. Consider the de-sparsified Lasso in ^ with ^ or and the parameter j3^ in 0 
indueed by the random design model (^. Assume (Al)-(A7). If X = Diy/log{p)/n and Xx = 
D 2 ^ylog{p)/n for Di , D 2 sufficiently large, then: 


ZfXJn . 


OJ. 


AA(0,1) (n —>• 00 ), 




where =E\s^^^Z^\‘^ 


P\33 


A proof is given in SectionThe representation of the normalization factor should facilitate to 
recognize its order of magnitude y/n. For construction of confidence intervals and hypothesis tests 


we need to consistently estimate the quantity Wp-jj: this is discussed in the following Section 3.1 
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Remark 1. If the assumptions in (A3), (A4) and (A6) hold uniformly in j, we can rephrase the 
statement of Theorem [T] as follows: 


ZjX, 


nuj. 


(6,-/30) = A, + lT„ 


PUJ 


max |A,|=op(l), IIT-=► AA(0,1) 

i=iv,p 


3.1 Estimation of the variance 

We can estimate ojp.jj = by the empirical variance of iiZj-i, 


n-^ '^{iiZj.i - n-^ ^rZj-rf, i = Y-Xp. 

i=l r=l 

Proposition 1. Consider the random design model with the projected parameter (3^ in 
Assume (Al), (A2), (A3), ||/3°||i = o^^Jnj log(p)) (which is part of assumption (AS)), (A6), (A7) 
and (D2) from Section^ (the latter is implied by the additional assumption (Af)). Then, 

~ ^ 3- op{l). 

A proof is given in Section We have as an estimate of the normalizing factor in Theorem 
the following expression: 


zjx, 


nuj. 


(7) 


P\33 


corresponding to the “sandwich formula” in the case with p < n (Eicker, 1967; Huber 1967 White 


1980; Freedman et ah, 1981). 


In particular the formula in ([^ is different than the usual expression for correctly specihed 
high-dimensional linear models, used in van de Geer et ah| (2014), 




( 8 ) 


where S') is an estimate of the error variance a"), e.g., ~ with 

i = Y — X/3. While the formula in ([^ is asymptotically valid for correctly specified models, the 
analogue in ([^ is robust and valid irrespective whether the model is correct or not. The expression 
in Q is now also implemented in the R-software package hdi. 


3.2 Sparsity of the projection and implications on the error 

The statement in Theorem depends, among other conditions, on assumptions (A5)-(A7) which 
are depending on the projection of the nonlinear to a linear model. In particular, (A5) requires 
sparsity of the projected parameter vector: even if the underlying true nonlinear regression function 
depends only on a few covariables, the projected parameter /3° in ([^ is not necessarily sparse. We 
provide here some sufficient conditions ensuring a sparse 
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Throughout this subsection, j3^ is as in ([^. We know that 

/3° = s-^r, 

S = Cov(x(°)), r = (Cov(/°(x(°)),X®Cov(/°(x(°)),X®)'^. 


Therefore, 


/30 = (9) 

t=i 

Denote by ||S“^||oo = maxjfc |(S“^)jfc| and by the £th column of and generally by 

Iloilo = Z]r=i 0) the ^o-sparsity of a d-dimensional vector u. 

Proposition 2. Consider the random design model with the projected parameter in 0, 
Assume that T, is positive definite (but not requiring bounds on its eigenvalues). The following 
holds: 

1 . ir-sparsity for 0 < r < 1; 


11/3°Hr < max II(S ^),£||r||r||^, 
whieh implies, for si = || 7 ^||[j = Y(k^i / 0), 

||/3°||r < (rnaxs^ 1)^/'’||S“^||oo||r||r 


2 . io-sparsity: 

||/3°||[1< + Sr = {j-, r,/0}, 

ieSr 

whieh implies 

||/3°||o < (maxs£ + l)||r||[]. 

A proof is given in Section As an example, consider the case where S is block-diagonal with 
maximal block-size equal to 6max- We then have that max£ -|- 1 = 6max and hence by Proposition 

El 

||/30||,<6VlJ|S-ioo||r||. (0<r<l), 

||/3°|lE]<&max||r||°. 
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Block dependence. Assume now that the predictor variables exhibit block dependence with 
blocks corresponding to the associated block-diagonal covariance matrix S. That is, there are 
blocks of variables, where the variables from different blocks are (jointly) independent, and these 
blocks induce a block-diagonal covariance matrix. Denote by Sfo C {1,... ,p} the support of f^{-) 
which contains all the variables which have an influence in 

Corollary 1. Assume the conditions of Proposition^^ In addition, assume block dependence with 
maximal bloek-size equal to 6max- We have that 

11 r 11 Q < 6max I Sj-o I , 

and, due to Proposition^ 

A proof is given in Section 

Proposition!^ and Corollary obviously lead to justifications of the assumption on the sparsity 
So in (A5), but also for the conditions in (A7). Regarding the latter: If ||/3'^||i < Cg < oo (which is 
implied by ||/3*^||o bounded and maxj \ j3^\ bounded) and assuming (A 2 ) we have that 

|e(o)| = |y(o) - (x(°))^/ 3 °| < +C9C2. 

Thus, assuming either < V for some fixed random variable V with E|yp < 00 or < 

M 3 < 00 (which are both rather weak assumptions) implies either (A7,a) or (A7,b), respectively. 

3.3 Gaussian design 

The bound in Proposition and Corollary for .^g-sparsity can be much improved when assuming 
that has a joint Gaussian distribution. This is in conflict with assumption (A2). However, for 
the case with Gaussian design, thereby dropping (A 2 ) and (A3), it would be easier to derive the 
statements from Theoremand Proposition!^ 

Proposition 3. Consider the random design model with the projected parameter (3^ in @). 
Assume that has a joint Gaussian distribution with positive definite covariance matrix S (but 
not requiring bounds on its eigenvalues). Then, 

So C SfO. 

A proof is given in Section !^ This is an important result saying that if we infer a variable as 
an active variable (significantly different from zero) in the misspecified linear model, it must be an 
active variable in the nonlinear true model. 

To make further statements, we represent the function as follows: 

d 

/°(^) = YjfkixSk), 

k=l 

{S*!,..., Sd} a partition: Sfo = Sk Ci Se = $ {k £), 


where xa denotes the subvector of x with components in A C {1,... ,p} and E[/°(X 5 'j,)] = 0; and 
the partition is finest in the sense that the representation of is given with the S^’s of smallest 
possible cardinality. For example, for the function considered in Section 

f{x) = -5 + 5 sin(7rxiX2) + 4(x3 - 0.5)^ + 2x^ + xq, (10) 

we have the partition 5i = {1, 2}, S 2 = {3}, = {5}, S '4 = {6}. 

Proposition 4. Consider the random design model with the projected parameter in (^. 
Assume that has a joint Gaussian distribution with positive definite covariance matrix S (but 
not requiring bounds on its eigenvalues). Consider the projected parameter in the submodel with 
variables from (k £ { 1 , ... ,d}): 

fiiSk) = argmin^g^|s,|E|/'?(X®) - 

For j G Sk we denote by c{j) the index of the component in fi{Sk) which corresponds to variable 
Xfl Then, 


Pj = PcU)iSk), 


saying that we can infer fdj with j G Sk from the submodel with variables 

A proof is given in the Appendix. As an example, we consider again f^ from (10). Proposition 
l^then implies: 


{Pi,P2V = argmin^g]g.2E|5sin(7rXf^X^°^) - = (0,0)'^, 

fil = argmin^gKE|4(xf) - 0.5)^ - 5 - xf = -4, 

Pi = argmin^g]RlE|2xf ^ - xf ^^1^ = 2, 

Pi = argmin^gjjIE|X® - xf VP = 1, 


and all Pj=0 for j ^ Sp. For the numerical values of Pi, Pi and Pi, we used that X^^) has mean 
zero. 


4 Fixed design model 

Consider the model as in (i but now with fixed design; 

>"(') =/0(X«)+e('), f = l,...,n, 


( 11 ) 


where ..., are i.i.d. with E[^W] = 0 and E|,f*^d|2 = fj^. As before, we denote the n x p 
design matrix by X and the re x 1 response vector by X = ..., We assume that 

rank(X) = n < p and thus, we can always represent the vector = (/^(X ^^^),..., f{X^'^^))'^ as 
X/ 3 I. The vector /?! is not unique, but we can look for some sparsest solution. We consider the basis 
pursuit solutio n (|Chen et al. 1998), known also as the solution from compressed sensing (Candes 
and Tao, 2006 Donoho, 2006): 


= argmin^{||/3||i;X/3 = f°}. 


( 12 ) 
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Thus, the model in © is correctly specified as a linear model 


Y = X/3^ + e with /3^ as in (12), 


(13) 


where e = ((^i,..., In particular, due to correct specification, the interpretation of /3^ is 

standard. 


We refer to this /3^ in (12) throughout this section (unless stated otherwise). We assume the 
following: 

(Bl) Ax Y^log(p)/n and ||Zj|||/n > C > 0; 

(B2) ||/3(A)-/30||i = op(l/Vi^). 

We justify these assumptions below. 

Theorem 2. Consider the de-sparsified Lasso in 0) wit h p[ ) or &■ and the fixed design model 
(11) with rank(X) = n and linear representation as in (13) with as in (12). Assume either 

Gaussian errors or condition (A7) and assume that > L > 0. Suppose that (Bl) and (B2) hold 
when using the nodewise Lasso or only (B2) when using the nodewise square root Lasso 
Then 




cr Z. 


ill2 


(6,-/3°) 


W(0,1). 


Proof: This follows from van de Geer et al. (2014, Th.2.1) for Gaussian errors. For non-Gaussian 
errors, we invoke the Lindeberg condition and proceed as for the proof of Theorem (Proposition 

0 - ° 

We argue first that (Bl) holds with high probability. Assume the following. 


Gonsider the setting where the rows of X arise as fixed i.i.d. realizations of a p-dimensional random 
variable X with covariance matrix E. 

(Cl) (i) 0 < Ct < 1 /(E ^)jj = > Cs < oo (the upper bound is implied by (A3); the 

lower bound is the analogue of (A 6 )); 

(ii) max,' ||A,'||oo < C 2 < oo (which is assumption (A2)); 

(hi) ||7°||i = o(-\/ n/ log(p)) (which is part of the assumption (A4a)). 

(C2) (Al), (A 2 ), (A5) and (A7). 

Proposition 5. (for nodewise Lasso only) Assume that (Cl) holds. Then, for Ax = T) 2 \/\og{p)/n 
with D 2 sufficiently large, assumption (Bl) holds with probability tending to one. 


A proof is given in Section 


□ 


Proposition 6. Consider the fixed design model 0 having a linear representation as in ( flAj ) with 
/3° as in (12). Assume that (C2) holds. Then, for A = Dl^ylog{p)/n with Di sufficiently large, 
assumption (B2) holds with probability tending to one. 

Proof. The statement can be derived as in the proof of statement 2 in Lemma in Section 


10 


















Sparse solutions and misspecification. We note that for a fixed design linear model, mis- 
specification with respect to the linearity in the unknown parameters cannot happen. The same 
is true when conditioning on the covariables X. In this scenario, we do not need to employ the 
“sandwich” variance formula in 0 but we can use the more standard expression from Q. What 
is important though is the interpretation of the parameter and of the output of the de-sparsified 
Lasso: the inferential statements are valid for a sparse approximation. We focused here on the 
choice of the basis pursuit solution in (12) which is perhaps among the simplest and which can 
be computed. But in fact, any solution of X/3 = satisfying assumption (B2) is good enough: 
or in view of Proposition any solution which is weak £r- (0 < r < 1) or £o-sparse, see (A5), is 
fine. A confidence interval then means that it covers any sufficiently ir- and £o-sparse solution (3^ 
of X/3 = This itself is a nice and “strong” interpretation of a confidence interval, namely that 
despite non-uniqueness, it covers all sparse solutions. 


5 Some empirical results 


We consider two non-linear models as in (j^ (or versions thereof for fixed design, see Section 5.2). 
The first one uses a nonlinear regression function from Friedman’s (1991) MARS paper but with 
smaller signal to noise ratio: 


(Ml) 

~ ^^,(0, E), = 1 Vj, E3,4 = E4,3 = 0.8, Ej-fc = 0 (j / k; j, k ^ {3,4}), 

/°(x) = -5 -k 2sin(7rxiX2) + 4(x3 - 0.5)^ -k 2 x 5 -k xq , 

~ AA(0,1). 


(M2) 

X( 0 ) as in (Ml), 

/°(x) = sin( 7 r/ 2 xi)x 2 + x|/5 -k X 5 -k xq/2, 

~ -^( 0 , 1 ). 


(M3) 


~ A'p(0,E), Ej-fc = /° as in (Ml). 


(M4) 


~ as in (M3), as in (M2). 


The intercept —5 in the function in (Ml) and (M3) ensures that E[/°(X(°))] = 0. 


5.1 Simulations for random design 

For random design, the corresponding parameters in (|^ are as follows: 

for model (Ml),(M3): /3° = (0, 0, —4, 0, 2,1, 0,..., 0)^ 
for model (M2),(M4): = (0, 0, 0.6, 0,1, 0.5, 0,..., 0)'^. 
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The values are in accordance with Proposition]^ because of Gaussianity of the design: the active 
set So = {3,5,6} C Sp = {1,2, 3, 5, 6}. Figure displays ||/3||{ as a function of r for 0 < r < 1. 
The log-sparsity is approximately a linear function in r, once increasing (for (Ml),(M3)) and once 

(M1 ,M3), random (M2,M4), random 




r 


r 


Figure 1: Random design for models (Ml),(M3) and (M2),(M4) with p = 1000. Plot of 
log-scale) as a function of r G [0,1] (r = 0 corresponds to the £o-sparsity), where /3° 


IS as m 


I / \ 

§)• 


on 


decaying (for (M2),(M4)). Our theory requires either weak .^^-sparsity or l^o-sparsity of (see 
(A5,a) or (A5,b)) and hence a possibly more realistic assumption than £o-sparsity alone. 

For simulations with random design, we generate n independent data points according to the 
models (M1)-(M4) where for each realization, we generate the X and ^ variables anew. We consider 
the case with sample size n = 200 and dimension p = 1000. We use the de-sparsified Lasso procedure 
as described in @ with the nodewise Lasso (§ and tuning parameters A and Ax (the same for 
all j) from the default in the R-software package hdi ( [Meier et al. 2014). For estimation of the 
asymptotic variance we use (|^. 

Table and Figure report empirical results based on 100 independent simulations. Denoting 
by Clj a confidence interval for (3^, the average coverage is 


avgcov(5o) = \So\-^Yl ^ 


j&So 


avgcov(5§) = |5S|-i ^ P[/3° G Cl,], 


(14) 


and the empirical analogue by replacing the probability “P” by an empirical average over the 100 
simulations. We consider the average expected length of the confidence intervals 


avglen(5o) = l^o] ^ ^ E[length(CI,)], 

j65o 

avglen(5(;) = \S^\~^ ^ E[length(CI,)], (15) 

iesg 
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and the empirical analogue by replacing the expectation “E” with an empirical average. The actual 


model 

avg. coverage S'q 

avg. coverage 5 q 

avg. length 5o 

avg. length Sq 

(Ml) 

0.98 

0.99 

3.01 

2.19 

(M2) 

0.91 

0.95 

0.48 

0.41 

(M3) 

0.98 

0.99 

4.18 

3.56 

(M4) 

0.95 

0.95 

0.70 

0.65 


Table 1: Random design. Average coverage and average length of confidence intervals (empirical 
versions of (14) and ([l^), for 5o and S'q separately (note that 5 q = 0 for (M3) and (M4)). Nominal 
level equal to 0.95. Sample size n = 200 and dimension p = 1000. 


(M1), random 


(M2), random 



a> - o 



n-^^^-1-r 

0.5 0.6 0.7 0.8 0.9 1.0 


active betaO 


active betaO 


(M3), random 


(M4), random 




T-1-1-1-1-T 

0.5 0.6 0.7 0.8 0.9 1.0 


active betaO 


active betaO 


Figure 2: Random design. Coverage as a function of the coefficients /3^ of the active variables with 
j G S'q. Nominal level equal to 0.95. Sample size n = 200 and dimension p = 1000. 

coverage results in Table and the more detailed view given in Figure are very satisfactory. We 
note that the lengths of the confidence intervals are not constant for the same covariance model for 
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X. The reason is that at least asymptotically (see Theorem [^, the length depends, among other 
things, on , and the error term itself depends on the trne function /^. This is in 

contrast to fixed design, where the asymptotic length of the confidence intervals is a function of 
and cr^ = = E|ejp only (see Theoremj^and formula (17) and (19)). 


5.2 Simulations for fixed design 


We consider the same models (M1)-(M4) but now with hxed design with n = 200 and p = 1000, 
where we use a fixed realization of the X variables in the corresponding model. We generate n 
independent data points according to the models (M1)-(M4) where for each realization, we generate 
only the ^ error variables anew. 

We note 
displays ||/3° 

and the parameter of interest, for 100 different independent simulation runs. The log-sparsity is 
approximately a linear decreasing function in r. Even more pronounced here for fixed than random 


that for all the four models with fixed design we have that ISqI = n = 200. Figure [ 
||(( as a function of r for 0 < r < 1, where is the basis pursuit solution from (|l2|) 



Figure 3: Fixed design for models (M3) and (M4) with n = 200 and p = 1000. 100 independent 
realizations and corresponding basis pursuit solutions /3® as in (12); the lines correspond to the 
100 different values of ||/3°||(( (on log-scale) as a function of r G [0,1] (r = 0 corresponds to the 
.^o-sparsity). 


design, we conclude that weak .^^-sparsity, as required by our theory, seems to be a much more 
realistic assumption than t'o-sparsity which is always equal to n = 200. However, we also see that 
for model (M3), the parameter /3^ is not very .^^.-sparse. Thus, it might be difficult that a confidence 
interval would achieve good coverage, see also Figure]^ and the last paragraph of this section. 

We use the de-sparsified Lasso procedure as described in (Q with the nodewise Lasso ([^ and 
tuning parameters A and Ax (the same for all j) from the default in the R-software package hdi 
(Meier et ah, 2014). For estimation of the asymptotic variance we use ([sj). Table and Figure 


1^ report empirical results for the basis pursuit solution in (12), based on 100 independent 
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simulations where the design is a fixed realization from the models (M1)-(M4). The actual 


model 

avg. coverage Sq 

avg. coverage 5 q 

avg. length Sq 

avg. length Sq 

(Ml) 

0.97 

0.98 

1.68 

1.69 

(M2) 

0.95 

0.97 

0.41 

0.41 

(M3) 

0.96 

0.97 

3.26 

3.27 

(M4) 

0.96 

0.96 

0.95 

0.95 


Table 2: Fixed design. Average coverage and average length of confidence intervals (empirical 
versions of (14) and for the basis pursuit solution /3^ in ( [T^ , for Sq and Sq separately. 

Nominal level equal to 0.95. Sample size n = 200 and dimension p = 1000. 


(M1), fixed 


(M2), fixed 



active betaO 


(M3), fixed 



active betaO 



active betaO 


(M4), fixed 



active betaO 


Figure 4: Fixed design. Coverage as a function of the coefficients (from basis pursuit in (12)) of 
the active variables with j G Sq. Nominal level equal to 0.95. Sample size n = 200 and dimension 

p = 1000. 

average coverage results in Tableare very fine. However, with the more detailed view in Figure]^ 
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the coverage can be quite poor for a few coefficients although this should be interpreted cautiously, 
as explained below. The poor coverage is particularly visible for the models (Ml) and (M3): a 
reason might be that the degree of weak ^^-sparsity of the basis pursuit solution (5^ in (12) is not 
as high as for (M2) and (M4) ((shown for (M3), (M4) in Figure]^. Regarding the lengths of the 
confidence intervals: we cannot confirm the asymptotic behavior saying that they are equal for 
the same covariance model for the realized X and the same error variances (e.g. (Ml) and (M2)), 
regardless of the true underlying nonlinear regression function. 

It is important to interpret the obtained confidence intervals as described in the last paragraph 
of Section]^ any solution of X/3 = which is weak I'r-sparse (0 < r < 1) or I'o-sparse is fine and 
should be covered by the confidence interval. Our hndings in Figure are for the basis pursuit 
solution only, and the latter is not very sparse (see Figure]^. This doesn’t imply though that 
there isn’t another solution (5^ which is ir- or ^g-sparse and whose components would be covered 
well by the obtained confidence intervals. Unfortunately, the latter statement is uncheckable due 
to the involved computational complexity; in contrast to the findings for the basis pursuit solution 
which can be easily computed with a linear program. Therefore, the somewhat negative findings 
indicated in Figure [^should be down-weighted. 


6 Discussion 


The current work offers a precise description of interpretation and (sufficient) assumptions for 
inference in a misspecified high-dimensional linear model. The following Table summarizes the 
main points with respect to interpretation and modification of the de-sparsified Lasso procedure. 
A modification of the variance as in ([^ is needed for the case of a random design misspecified 
model. Such a modification seems always advisable for the random design case, as it is consistent 
irrespective whether the model is correct or not and hence offers some robustness against model 
misspecification; see for example Huber (1967). The conceptual parts, as indicated in Table will 
not change for generalized linear models as one can link them to weighted linear regression. One 
should decide beforehand, whether the inference should be performed with fixed X (or conditional 
on X) or whether X is considered as random. The interpretation of the parameter (see Table 
changes when the true underlying regression function is non-linear, perhaps more dramatically 
than expected. For the special case of Gaussian random design we have the interesting property 
that Sq C Sfo (Proposition!^, saying that if a variable is significant in the misspecified linear 
model, it must be relevant in the true nonlinear model. 


6.1 Sample splitting methods 

Regarding other methods for construction of p-values and confidence intervals, we briefly discuss 
sample splitting techniques. Such procedures, including the preferred multiple sample instead of 


single sample splitting (Meinshausen et ah, 2009), can be used for the random design misspeci¬ 


fied case. The reason is that the sample splitting device implicitly assumes the same probability 
distribution in split samples, and this holds for random X (but typically not for fixed X) and 
implies the same projected parameter (3^ in ([^ in split samples. If the linear model is correct with 
the same sparse true for every sample point, sample splitting can also be used for fixed design 
cases (because both split samples are from a fixed design linear model with parameter vector j3^). 
However, for the fixed design model as in (13), the issue is different since e.g. the basis pursuit 
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design 

interpretation of f3^ 

modification 

random design 

via projection in (j^; 

with model-free interp. described after (j^; 
for Gaussian des.: active set property (Prop. 

modified variance in (j^ 

fixed design 

any sparse solution of X/3 = f^ 

(e.g. basis pursuit solution in (12)); 
with standard interp. (since no misspecif.) 

no modification 


Table 3: Conceptual summary of interpretation and required modification of the de-sparsified Lasso 
procedure for misspecified high-dimensional linear model. The required assumptions for asymptotic 
validity of the method are described in Theorems and In case of fixed design where the true 
underlying regression function is linear with a corresponding sparsest “true” parameter vector 
the basis pursuit solution typically coincides with (see compressed sensing literature (Candes 
and Tao, 2007, cf.)). 


solution (5^ in (12) would be different for every split sample. 

A modification is necessary though for the misspecified random design 
dimensional inference, which is what is used after screening for variables in 
sample, one has to use a modihed estimator for the variance, analogously to 
which is robust against model misspecihcation. 


case: even for low- 
the first half of the 
the estimator in Q 


7 Proofs 

7.1 Proof of Theorem for random design 

We prove here the statement of Theorem under slightly weaker assumptions than in condition 
(A). In this section, X is always random and the parameter j3^ as in Q. 

7.1.1 Preliminary results 

We show here that the following conditions hold: 

(Dl) maxfc^j |e'^Xfc/n| = Op(-yiog(p)7n). 

(D2) For either the nodewise Lasso in (j^ or the square root Lasso in (j^: ||7j(Ax) “ 7j 111 = 
op{l/^J\og{p))). 

(D3) ||/3(A) - /3°||i = op{l/^\og{p)). 

Lemma 1. For random X, assume (A2) and < C < oo for some constant C > 0 (the 

latter is implied by (A7)). Then, (Dl) holds, that is: 

max|e'^XA:/n| = Op{\/\og{p)/n). 
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Proof: Using Nemirovski’s inequality (Biihlmann and van de Geer 
obtain: 


2011, Lemma 14.24) we 


E[max \n < 8log(2p)C'|C'6/n = 0(\og{p)/n). 

i<i<p 

Thus, since Xj] = 0 and using Markov’s inequality: 

P[ max \n~^£^Xj\ > c] < E[ max \n~^£^Xj\]/c < /e[ max XA"^]/c = 0{^/\og{p)/ n) /c. 


This completes the proof. 


□ 


Lemma 2. For random X, assume (Al) and (A2). 

1 . Then, for Xx = D 2 \/\og{p)/n with D 2 sufficiently large, (A3) and (A 4 ) imply (D 2 ). 

2. //E|e*'^^p < C < 00 for some constant C > 0 (the latter is implied by (Al)), then for 
A = Dl^J\og(p)/n with Di sufficiently large, (A5) implies (D3). 


Proof: The first and second statement can be proved analogously. For the first one, due to 
(A3), the error when regressing Xj versus = {Xk; k 7 ^ j} is bounded. 

When invoking the ^o-sparsity assumptions (A4,b) or (A5,b), respectively, we know that the 
compatibility condition holds with probability tending to one: because of (Al), (A2) and the Iq- 
sparsity assumption (Biihlmann and van de Geer 2011, cf. Ch. 6.12)). Therefore, and using 


Lemma [I| we obtain the statements invoking some oracle inequality for the Lasso (Biihlmann and 


van de Geer, 2011, cf. Th.6.1) or the square root Lasso (van de Geer, 2014, Th.1.4.2). 

When invoking the f^-sparsity (0 < r < 1) assumptions (A4,a) or (A5,a), respectively, we can 
use the results from van de Geer (2015, Sec.5) which apply not only for the square root Lasso 
but also for the Lasso ( van de Geer| 2014, cf.Th.1.3.2). We need to argue that the compatibility 
condition holds with probability tending to one for, e.g. when proving the second statement, the 
set: 


= {f l/3°| > C^\og{p)/n/K 


.{So)} 


Due to the assumption on ,^i-sparsity and due to the assumption that A(5o) is bounded, we have 
that IS"*! = o{n/\og{p)). Therefore, due to (Al) and (A 2 ), the compatibility condition holds for S* 
with probability tending to one (Biihlmann and van de Geer, 2011 cf. Ch. 6.12)). □ 


7.1.2 Proof 

Denote by = Xj — X_j 7 ?, analogously as in Section but now for n x 1 vectors. We first 
analyze the behavior of the part Zjejn. We have that 

E[ejXfe;i] = 0 V/c, 

and hence E[(Z?)^e] = 0 . 
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Proposition 7. Assume (Al), (A3), (A6) (only that Up-jj > Q) and (A7). Denote by eOp.jj = 
Then: 


e'^Z^/n 

^ - => J\f{0, 1) (n —>• oo). 


uo, 


p\jj 


Note that p = Pn is allowed to depend on n. 

Proof. Denote by Wp-i = ^iZj.^. Since Cov{£i, X^-i) = 0 V/s, we have that E[Typ;j] = 0. 
Furthermore, Wp-i ,..., Wp-n are independent. We verify the Lindeberg condition. For k > 0, 


lim 


W^dP = 0 . 


Assuming (A7,a), we invoke the dominated convergence theorem: 

s W s PWf'l" < vWl 

Because /(|VFp| > Ky/nup-jj) = 0 (n —>■ oo) in probability, and hence 

\^p\ ^\Wp\>K^/^p.^jj ~ Of (1)) 

and because of the dominated convergence theorem we conclude that the Lindeberg condition holds. 
Assuming (A7,b), we have that 'E\Wp-i\^^^ < Elejp+'^C'g"''^ < The Lindeberg condition is 

then implied by the Lyapunov theorem. □ 

Proposition 8. (with Zj instead of Zj) 

Assume (Al), (A3), (A6), (A7), (Dl) and (D2). Then: 

AA(0,1) (n —oo). 


Proof. We only need to control the difference E^{Zj — Z^)ln. We have that 
\e^{Z^j - Zj)/n\ < max|e'^Xfc/n| \\fj - 7 °111- 

The statement then follows from Propositionand invoking (Dl) and (D2). 
Proposition 9. Assume (A2), (A3), (A6), (A7), (Dl), (D2) and (D3). Then: 


n- 


ZjXj/n 


U). 


(bj — Pj) => AA(0,1) (n —)■ 00 ). 


P',3J 


□ 


Proof. The statement follows by standard arguments as in van de Geer et al. (2014), requiring 
(D3), and using Proposition]^ For the case with the square root Lasso in ([^ , the proof is analogous. 

One can easily show that \\Zj\\ 2 /y/n = +op{l), due to (A2), (A3), and (D2), and E|Z® p 

is upper bounded by (A3). □ 

Using the results from Section 7.1.1 and Propositionestablish the result from Theorem]^ □ 
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7.2 Proof of Proposition 

We write 


n n 

= n~^ + (ei - + {Zj-,i 

i=l i=l 



We then get 


n 


-1 Y.^erZ,.,f = n-i Y.^e,Zl,f + A. 


i=l 


2=1 


One can easily show that A = op(l) by using Holder’s inequality (for £i —ioo'-, and Cauchy-Schwarz 
for ^2 — ^ 2 ) and invoking the following: 


max \ Z^.^\ < C 3 < 00 due to (A3) 

max|Z,;i — < max|Xjo||| 7 j — 7 ^||i = op(l) due to (A2) and (D2), 

||e — eW^i/n = ||X(/3 — /3°)|||/n = op(l) due to (A2), 


°||i = o{y/n/\og{p)) and (A7), 


where the last bound follows from e.g. Biihlmann and van de Geer (2011, Cor.6.1). Therefore, 


n-^ Y,{e^Z^,if = + op(l). 

2=1 

Furthermore, and simpler to obtain: 


n-i = E[e(0)z®] + op(l) = op(l). 

2=1 

Due to (A 6 ), the latter two displayed formulae complete the proof. □ 


7.3 Proof of Proposition 

For statement 1, consider: 

Ei/5?r 

3 =^ 

VP P P P 

£ E(E i(s-')jdiui)’' < EE i(s-‘knur = E ii(s-‘).Hi;iur < maxii(E-‘).<ii;iiric. 

j=i 1=1 j=i £=i e=i 

Furthermore, we have that max^ ||(S“^),£||(( < (max^ + 1)||E“^||)(^ and therefore statement 1 is 
complete. 

Regarding statement 2, we use the following argument. Every point £ € Sr can lead to at most 
+ 1 non-zero values of the components of /3°, due to formula ([^. Hence we obtain both bounds 
for ||/3 °||q. □ 


20 






7.4 Proof of Corollary 

The bound above for ||r||Q follows by a similar argument as for statement 2. in Proposition]^ every 
support point in Sfo exhibits a dependence with at most ^max X-variables: therefore there are at 
most 6 max|*S'/o| non-zero covariances between f^{X) and the X-variables. □ 

7.5 Proof of Proposition 

It is well known that 


f^^ = E[Z^f{X)]=E[Z^f{Xs^,)]. 

Furthermore, since is the residual when projecting Xj^^ onto X^j = k ^ j} and due 

to the Gaussian assumption: Z® is independent of k 7 ^ j}. 

Therefore, if j ^ 5yo, Z^^ is independent also of and therefore, using the representation 
for above: f5j = E[zj°^]E[/‘’(X 5 ^g)] = 0, saying that j ^ S^. This proves the claim. □ 


7.6 Proof of Proposition 

As mentioned already in the proof of Proposition we know that is independent of k 7 ^ 

j}. Therefore, for j G Sk- 

= E[zf/0 (a( 0))] = E[zf (/?(xg)) + ... + /,°(4°)))] = E[zf/0(Ag))]. 


This means that we can obtain Pj from projecting f^{X^^) onto {Xj^^; j 


do). 


= 1 , 


,P}- 


7 = argmin^gKP®^l/°(4?) “ 


and Pj = jj. But we know from Proposition that for the support of 7 : 


Si'i) = {j; ij / 0} c Sk- 


(16) 


Therefore, we can restrict the projection in (16) to the variables from Sk'- 


7 = argmin^g^|s,|E|/°(A®) - {xfj'^P\^, 

and P^ = 7 c(j)) where c{j) the index of the component in 7 which corresponds to variable X^^\ 
This completes the proof. □ 


7.7 Proof of Proposition 

We write 


IZ. 


7112 /^^ = 


|ZO||2/n+||X_,-(7,-7 


/n -b: 


|E|<2||ZO||2/^/^||X_,(7,■-70)||2/V^, 


(17) 
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Due to (Cl,i) we have that 


■^?|| 2 /^ ^ Ct l‘l with probability tending to one. 


(18) 


We can also establish, analogous to Biihlmann and van de Geer| (20 lT| Cor.6.1) invoking (Cl,iii), 
but now controlling max^^j |(Z^)^Xfc|/n = Op(-yiog(p)/n) (see Lemma 1 and using (Cl,i) and 
(Cl,ii)): 


= op(l). 

By ( [Xt] ), ( fls] ) and (19) we complete the proof. 


(19) 

□ 
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